To read files, there exists a built-in called open. Open takes a path/filename and a mode.
In Python, a file can be opened in the following modes
By default, a file is opened in text mode, that is to say the contents of the file are interpreted as text. If you want to handle the file as binary data, you can open the file in binary mode, either "rb", "wb" and "ab" depending on which mode you want to open it.
The readline() function reads a file until a newline "\n" character is reached and returns that string. A file object is iterable so you can also iterate over lines using a for statement.
You should always take care to remember to close any files you've opened.
In [ ]:
filename = "../data/example_file.txt"
fp = open(filename, "w")
for string in ["Hello", "Hey", "moi"]:
fp.write(string + "\n")
fp.close()
fp = open(filename, "r")
for line in fp:
print(line.strip()) # the \n is contained in the line, calling strip removes whitespace at the end and beginning
fp.close()
In [ ]:
with open(filename, "r") as file:
for line in file:
print(line.strip())
Python standard library contains modules for dealing with gzip (.gz), bzip2 (.bz2) and the less used LZMA algorithms. Additionally there is support for opening ZIP and tar archives, which may contain multiple files. Information can be found in the documentation.
There are more tools in the Python Package Index for many other formats.
The beauty of handling compressed files is that the abstraction level is essentially the same as working with an uncompressed file.
In [ ]:
import gzip
# the library gzip offers an API like open(), see https://docs.python.org/3/library/gzip.html
zipped_file_name = "../data/zipped_file.gz"
with gzip.open(zipped_file_name, "wt") as zipped_file:
for line in ["This", "is", "an", "example", "."]:
zipped_file.write((line + "\n"))
In [ ]:
## Go ahead, try to read the lines read from zipped_file_name and print them.
## It goes just like in the examples above, except with gzip.open instead of open
## as this is text, you'll need to open the file in mode "rt" and not just r
The implementation technical details like the "rt" vs "r" mode vary a bit, check documentation when unsure.
Data can be stored in myriad ways.
Very common is the so-called Comma-Separated Values
header1,header
1,0
0,1
1,0
Another common one is JSON
{"key": "value", "key2": "value2"}
XML is, of course an alternative
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
</data>
XML examples would require such in-depth knowledge of XML that they are not covered in this notebook. Suffice to say that it is possible to handle XML files.
Many software packages read their configurations from files in the INI format
[default]
value = 5
[special_configs]
bigger_value = 6
These 4 are mentioned as examples because there are libraries in the Python Standard Library:
Of special interest is also Python's internal pickle expressly for the purpose of serializing Python objects.
In [ ]:
# we will use this object throughout the examples to illustrate different file format handling
pythons = [
{"name": "Graham Chapham", "birthyear": 1941, "dead": True},
{"name": "Eric Idle", "birthyear": 1943, "dead": False},
{"name": "Terry Gilliam", "birthyear": 1940, "dead": False},
{"name": "Terry Jones", "birthyear": 1942, "dead": False},
{"name": "John Cleese", "birthyear": 1939, "dead": False},
{"name": "Michael Palin", "birthyear": 1939, "dead": False},
]
The Comma-separated values format seems deceptively simple at first and a casual reader can be tempted into trying to create a parser themselves.
"What could possibly go wrong? It's a really simple format after all."
- every starting developer at least once in their career
The number of different conventions makes parsing all kinds of CSV files highly nontrivial and it is therefore good that there is a separate library for that purpose.
There are two simple ways to use the built-in csv -library:
Without headers
With headers
dict
In [ ]:
import csv
filename = "../data/example.csv"
with open(filename, "w") as file_:
writer = csv.DictWriter(file_, fieldnames=["name", "birthyear", "dead"])
writer.writeheader()
for performer in pythons:
writer.writerow(performer)
In [ ]:
def print_performer_dict(performer):
import datetime
this_year = datetime.datetime.now().year
if performer["dead"].lower() == "true":
print("%s is dead" % performer["name"])
else:
print("%s turns %d this year" % (performer["name"],
this_year - int(performer["birthyear"])))
with open(filename, "r") as file_:
reader = csv.DictReader(file_)
for performer in reader:
print_performer_dict(performer)
Note how the truth value and number needed a bit of tinkering. This is one of the downsides of the CSV format, there is no agreed upon way to mark what is a string and what is a number and what is a boolean value.
JSON is a data interchange format of the web age. It has several flaws, like CSV but yet it is widely used.
In Python, one can usually convert dicts as JSON hashes and lists as JSON lists, with some minor caveats. JSON doesn't differentiate between lists and sets, and Python only permits immutable objects as dictionary keys. Also, there is no simple agreed upon way to encode dates and times in JSON (ISO8601 for human-readable dates and possibly Unix timestamps for machine readable dates are recommended).
Also, the default json library may not be optimal in many respects. There are alternatives, like
The different libraries convert corner cases differently and it's usually not a good idea to use JSON as a persistence format between multiple Python softwares.
However the requirement does not rise very often when dealing with Internet-based systems.
The dumpand load functions operate directly on files and take a filelike object as a parameter. The dumpsand loadsfunctions return and read a string, which is what the s stands for.
In [ ]:
import json
# we have two strategies, store the entire object as JSON or store each row as a separate JSON object,
# both exist in the wild world so both will be shown
# fortunately our dicts only contain very simple datums so there will be no issue
ex_1_file = "../data/example_json_1.json"
ex_2_file = "../data/example_json_2.json"
with open(ex_1_file, "w") as file_:
json.dump(pythons, file_)
with open(ex_2_file, "w") as file_:
for performer in pythons:
json.dump(performer, file_)
file_.write("\n")
In [ ]:
#reading back
def print_performer_dict_2(performer):
import datetime
this_year = datetime.datetime.now().year
if performer["dead"]:
print("%s is dead" % performer["name"])
else:
print("%s turns %d this year" % (performer["name"],
this_year -performer["birthyear"]))
with open(ex_1_file, "r") as file_:
data = json.load(file_)
for performer in data:
print_performer_dict_2(performer)
print("####")
with open(ex_2_file, "r") as file_:
for line in file_:
performer = json.loads(line)
print_performer_dict_2(performer)
The simplest way to store Python objects is pickle. It is the standard way to serialize and deserialize Python objects.
Pickle serializes Python objects into strings that can be unpickled by other Python processes and threads. It can pickle almost any data presented in Python. The tricky part is ensuring that both Python processes have the same version (pickle is not backwards-compatible) and that they have the same versions of all relevant libraries.
Another caveat is that other programming languages don't support pickle, it's Python-only.
In [ ]:
import pickle
pickled_pythons = pickle.dumps(pythons) #pickle also has dump and dumps like json
#we could write pickled_pythons to a file here if we wanted to, but that's not really the point of the exercise
unpickled_pythons = pickle.loads(pickled_pythons)
print(str(pythons) == str(unpickled_pythons))
The beautiful thing about pickle is that it will serialize complex objects and deserialize them the same way.
For example when training classifiers and regressors in machine learning one can train a classifier on a powerful computer for a long time until the algorithm converges, pickle the resulting object (classifier or regressor) and distribute that to other machines.
Many Python libraries operate on files or strings. Some library writers assume that everyone will always want to pass a file to their library. Others assume that the results should always be written to a file on the filesystem even when that is not strictly necessary.
For that purpose the io library in Python offers tools to create objects that look like files, even when they aren't.
There are two classes
In [ ]:
import io
my_output = io.StringIO()
writer = csv.DictWriter(my_output, fieldnames=["name", "birthyear", "dead"])
writer.writeheader()
writer.writerows(pythons)
file_contents = my_output.getvalue()
print("file contents would have been:\n")
print(file_contents)
print("---")
#let's construct another StringIO and use csv to read from a string and not a file
my_input = io.StringIO(file_contents)
reader = csv.DictReader(my_input)
for line in reader:
print_performer_dict(line)
In [ ]: